This dataset is public available for research. The details are described in [Cortez et al., 2009]. P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
Prediction of the quality ranking by tasters from the various measured properties of red wines to guide grape growers and wine producers regarding a wine quality. Do some of these properties have a significant effect on quality? If so, which ones?
Input Variables:
fixed acidity (tartaric acid - g/dm^3): most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
volatile acidity (acetic acid - g/dm^3): the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
citric acid (g/dm^3): found in small quantities, citric acid can add “freshness” and flavor to wines
residual sugar (g/dm^3): the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
chlorides (sodium chloride - g/dm^3): the amount of salt in the wine
free sulfur dioxide (mg/dm^3): the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
total sulfur dioxide (mg/dm^3): amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
density (g/cm^3): the density of water is close to that of water depending on the percent alcohol and sugar content
pH (scale between 0 and 14): describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
sulphates (potassium sulphate - g/dm^3): a wine additive which can contribute to sulfur dioxide gas (SO2) levels, which acts as an antimicrobial and antioxidant
alcohol (% by volume): the percent alcohol content of the wine
Output Variable (based on sensory data):
The dataframe is replaced by a subset of itself with following modifications:
Added a column \(quality.f\) with the quality values as a factor type.
Removed first row ID column - it doesn’t have any value to the analysis.
The information about the structure of the dataframe and variable data types.
## 'data.frame': 1599 obs. of 13 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ quality.f : Factor w/ 10 levels "1","2","3","4",..: 5 5 5 6 5 5 5 7 7 5 ...
The dataset consists of 13 variables with 1599 observations. There is an aditional variable \(quality.f\) created as a factor of the quality scores and will be used to create a model. The variable \(quality\) is integer type, \(quality.f\) - factor type, the rest are numeric type.
Descriptive statistics of every variable in the dataset.
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
##
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
##
## quality quality.f
## Min. :3.000 5 :681
## 1st Qu.:5.000 6 :638
## Median :6.000 7 :199
## Mean :5.636 4 : 53
## 3rd Qu.:6.000 8 : 18
## Max. :8.000 3 : 10
## (Other): 0
The summary statistics above include the mean, standard deviation, range, and percentiles. It reveals the mean for most variables is greater than the median. This indicates that there are outliers. Only \(density\) and \(ph\) have median about the same as the mean, the sign of normal distribution. The \(quality\) min value is 3, max - 8, that might indicate that our dataset doesn’t include any measurements of worst or best quality wines. Variables \(residual.sugar\), \(chlorides\), \(free.sulfur.dioxide\), \(total.sulfur.dioxide\) have outliers very far away, since the max values are way above the 3rd quartile.
Note: To save space there is no measurement units indicated in the following plots, charts or graphs in the analysis. Please refer to the table below, if needed.
## var_name measure_units
## 1 fixed.acidity g/dm^3
## 2 volatile.acidity g/dm^3
## 3 citric.acid g/dm^3
## 4 residual.sugar g/dm^3
## 5 chlorides g/dm^3
## 6 free.sulfur.dioxide mg/dm^3
## 7 total.sulfur.dioxide mg/dm^3
## 8 density g/cm^3
## 9 pH scale 0-14
## 10 sulphates g/dm^3
## 11 alcohol %
## 12 quality scale 1-10
## 13 quality.f scale 1-10
The histograms and bar plot to explore the distribution of each explanatory variable. I am not sure, if they are completely independent.
Figure 1.
As shown in Figure 1, \(quality\) variable has most values concentrated in the categories 5, 6. Only a small proportion is in the rest of categories. There is no values in category 1, 2, 9, 10. Variables \(residual.sugar\), \(free.sulfur.dioxide\), \(total.sulfur.dioxide\) and \(sulfates\) have a positively skewed distribution. \(alcohol\) and \(citric.acid\) have an irregular shaped distributions. \(density\) and \(pH\) appears as normal distributions.
Boxplots for each of the explanatory variables.
Figure 2.
The boxplots in Figure 2 show distribution of variables from a different angle. I can see that all variables have outliers. \(free.sulphur.dioxide\), \(density\) have few outliers far away from the most of other observations. Variables \(fixed.acidity\), \(volatile.acidity\) and \(citric.acid\) have a lot of outliers. Variables \(alcohol\) and \(citric.acid\) don’t have pronounced outliers. Variable \(quality\), \(density\) and \(pH\) have about normal distribution. Very heavily skewed distributions for \(sulphates\), \(residual.sugar\) and \(chlorides\).
To get the overview of the relationship between variables, I produced a pairwise comparison of explanatory variables of the dataset. The column \(quality.f\) is dropped as it is a factor type variable. The graph provides two different comparisons of each pair of columns and displays color-encoded correlation coefficient of the respective variables. The legend displays 8 levels of the coefficient from -1 to +1.
Figure 3.
The plot in Figure 3 provides us with a very general idea of the correlations between variables. I picked some pairs with the highest correlation numbers (two darkest colors) to do some mere detailed analysis.
Scatterplots to pair up more interesting input values in the data set with added smoothed conditional mean, which helps in seeing patterns when overplotting.
Figure 4.
Total acidity is divided into two groups, namely the volatile acids and the nonvolatile or fixed acids. One of the predominant fixed acids found in wines is citric acid. So it is not a surprise to see strong correlation betveen \(fixed.acidity\) and \(citric.acid\).
There is a negative moderate correlation between \(volatile.acidity\) and \(citric.acid\). The disadvantage of adding citric acid is its microbial instability. In the European Union, use of citric acid for acidification is prohibited.
The term “sulfites” is an inclusive term for sulfur dioxide (SO2). SO2 is a preservative and widely used in winemaking because of its antioxidant and antibacterial properties. A small amount of sulfites is produced naturally as a byproduct of fermentation, but most of the SO2 has been added by the winemaker.
Figure 5.
Total sulfur dioxide is divided into two groups: free sulfur dioxide and bound sulfur dioxide. So, again, it obvious why \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\) have a strong correlation. See Figure 5.
The measure of the amount of acidity in wine is known as the “titratable acidity” or “total acidity”, which refers to the test that yields the total of all acids present, while strength of acidity is measured according to pH, with most wines having a pH between 2.9 and 3.9.
Figure 6.
The plot in Figure 6 shows a negative strong correlation between \(fixed.acidity\) and \(pH\).
To overview and better understand relationships between the output variable and all input variables I produced scatterplots pairing up all explanatory variables with the main feature \(quality.f\).
Figure 7.
From the plot, it does look like \(fixed.acidity\) and \(quality\) has a slight positive correlation. A small number of wines of an average quality (5) has extremely high acidity. The mean for all quality levels is bigger than median, so the \(fixed.acidity\) distribution has a positive skew.
Figure 9.
The variable \(volatile.acid\) might have a fairly even distribution and moderate negative corealtion. There are some bigger ouliers for wines with quality level 5.The mean for all quality levels is bigger or equal to median, so the distribution must be a positive skew.
Figure 10.
The variable \(citric.acid\) might have a fairly even distribution and positive corealtion. There are 2 observations with very high outliers for wines with quality level 4.
Figure 11.
Looking at the histogram in Figure 1 it seems the variable \(residual.sugar\) is heavily right-skewed. To better understand the data, the boxplot is produced with applied logaritmic transformation. Result shows a very light correlation and a lot of outliers in quality kategories 5-7.
Figure 12.
This plot is also produced with \(chlorides\) with applied logaritmic transformation. Result shows a lot of outliers in quality kategories 5-6, with few ouliers at level 7. Very week negative correlation.
Figure 13.
The mean of \(free.sulfur.dioxide\) for all quality levels is bigger than median, so the distribution must be positively skewed. Very light negative correlation.
Figure 14.
To get better view, the chart is produced applying logaritmic transformation. It reveals a negative week correlation between variables \(quality\) and \(total.sulfur.dioxide\).
Figure 15.
Varible \(density\) has a very small range (0.9901- 1.0037) with ouliers placed about equally to both ends of scale. The distribution is about normal.
Figure 16.
The distribution appears normal with very few ouliers mostly located in \(quality\) levels 5-7.
Figure 17.
After applying the logistic transformation, the plot reveals a lot of outliers in the wine of average quality at levels 5, positive correaltion.
Figure 18.
From the plot, it apears the correalion is positively strong. Interesting distribution of amount of alcohol between levels 5 and 6. 75th percentile of alcohol of level 5 is lower than median of level 6.
The plots and analysis of explanatory variable vs. response variable revealed some insight into data. I think, it is best to compute both Spearman’s and Pearson’s correlations, since the relation between them might give some information. Spearman coefficient is computed on ranks and so depicts monotonic relationships while Pearson’s is on true values and depicts linear relationships.
## # A tibble: 6 × 2
## cor pair
## <dbl> <chr>
## 1 -0.68297819 redwine$fixed.acidity and redwine$pH
## 2 0.67170343 redwine$fixed.acidity and redwine$citric.acid
## 3 -0.55249568 redwine$volatile.acidity and redwine$citric.acid
## 4 0.66804729 redwine$fixed.acidity and redwine$density
## 5 0.04207544 redwine$residual.sugar and redwine$alcohol
## 6 0.47616632 redwine$alcohol and redwine$quality
The Pearson product-moment correlation coefficient is a measure of the strength of the linear relationship between two variables. Data shown in the table above are Pearson’s Correlation coefficient and corresponding pair of variables. The numbers support our previous observations about the relationships between picked variables.
## # A tibble: 6 × 2
## rho pair
## <dbl> <chr>
## 1 -0.7066736 redwine$fixed.acidity and redwine$pH
## 2 0.6617084 redwine$fixed.acidity and redwine$citric.acid
## 3 -0.6102595 redwine$volatile.acidity and redwine$citric.acid
## 4 0.6230708 redwine$fixed.acidity and redwine$density
## 5 0.1165481 redwine$residual.sugar and redwine$alcohol
## 6 0.4785317 redwine$alcohol and redwine$quality
Data shown in the table above are Spearman rho coefficient and corresponding pair of variables. The highest negative correlation is calculated between \(fixed.acidity\) and \(pH\), the highest positive correlation is for \(fixed.acidity\) and \(citric.acid\) pair.
I will be using Multinomial Logistic Regression to model ordinal outcome variable, in which the log odds of the outcomes are modeled as a linear combination of the predictor variables. I begin the analysis by including all variables and all interactions between those variables.
##
## Call:
## glm(formula = quality.f ~ fixed.acidity + volatile.acidity +
## citric.acid + residual.sugar + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + density + pH + sulphates + alcohol,
## family = binomial(link = "logit"), data = redwine)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.96787 0.00752 0.02341 0.05730 1.22446
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 494.89510 523.99352 0.944 0.344931
## fixed.acidity -0.28791 0.67009 -0.430 0.667446
## volatile.acidity -8.40765 2.50988 -3.350 0.000809 ***
## citric.acid -3.70698 3.92708 -0.944 0.345195
## residual.sugar 0.14205 0.29387 0.483 0.628827
## chlorides -13.03262 7.00680 -1.860 0.062886 .
## free.sulfur.dioxide -0.15367 0.08888 -1.729 0.083823 .
## total.sulfur.dioxide 0.09925 0.04981 1.992 0.046322 *
## density -470.45027 533.58716 -0.882 0.377953
## pH -8.01302 4.80305 -1.668 0.095253 .
## sulphates 2.69403 3.47425 0.775 0.438088
## alcohol 1.32310 0.77934 1.698 0.089563 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 121.428 on 1598 degrees of freedom
## Residual deviance: 69.165 on 1587 degrees of freedom
## AIC: 93.165
##
## Number of Fisher Scoring iterations: 10
The Multinomial Logistic Regression Model result table reveals the most influential variables to the quality by adding the significance symbols on the side of the p-value. The lowest p-value 0.000809 has \(volatile.acidity\), it is marked with 3 stars “*“.
To select a set of predictor variables from the set I performed the Stepwise Variable Selection. This is one of the available options to confirm the previous findings.
##
## Call:
## glm(formula = quality.f ~ volatile.acidity + citric.acid + free.sulfur.dioxide +
## total.sulfur.dioxide + density + pH + alcohol, family = binomial(link = "logit"),
## data = redwine)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.2580 0.0076 0.0244 0.0607 1.1836
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 474.57117 267.88706 1.772 0.0765 .
## volatile.acidity -9.64610 2.08812 -4.620 3.85e-06 ***
## citric.acid -5.89262 3.00918 -1.958 0.0502 .
## free.sulfur.dioxide -0.14989 0.08004 -1.873 0.0611 .
## total.sulfur.dioxide 0.10963 0.04756 2.305 0.0212 *
## density -458.93682 266.23979 -1.724 0.0847 .
## pH -6.41360 3.57637 -1.793 0.0729 .
## alcohol 1.59324 0.67825 2.349 0.0188 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 121.428 on 1598 degrees of freedom
## Residual deviance: 72.299 on 1591 degrees of freedom
## AIC: 88.299
##
## Number of Fisher Scoring iterations: 10
The selection of variables, p-values and significance codes slightly varies from the Multinomial Logistic Regression Model results, but it confirms the general trend. First of all, I can see that out of 11 input variables 4 variables are not statistically significant.
As for the statistically significant variables \(total.sulfur.dioxide\), \(alcohol\), \(volatile.acidity\), the former has the lowest p-value suggesting a strong association with the probability of having higher quality of wine. The negative coefficient for this predictor suggests that all other variables being equal, with less \(volatile.acidity\) the outcome less likely will have higher quality.
From the variable selection table I can see that \(volatile.acidity\) and \(alcohol\) have lowest p-values, so in dataset they might have the biggest input to the final \(quality\) result.
Figure 19.
In the Figure 19 the plot of the distribution of \(volatile.acidity\) vs \(alcohol\) reveals quite clearly the clustering by color-coded quality levels. The lowest quality wines have higher volatile acidity and lower alcohol level. The highest quality wines have higher alcohol levels, slightly lower volatile acidity.
Figure 20.
Summary of the \(quality.f\) variable
## 1 2 3 4 5 6 7 8 9 10
## 0 0 10 53 681 638 199 18 0 0
As shown in the histogram in Figure 19 and summary, \(quality.f\) variable has most values concentrated in the categories 5, 6. Only a small proportion is in the rest of categories. There are no values in category 1, 2, 9, 10. That means in the sample of tested wines, there wasn’t any very bad or very good wines presented for the testing. This makes me question the credibility of the data set.
Figure 21.
As shown in Figure 10, \(free.sulfur.dioxide\) and \(total.sulfur.dioxide\) variables show the strongest correlation among all wine parameters (see Spearman Rank Correlation table ) and it equals to 0.789.
From the chart, it does look like there might be a threshold of about 100 for higher quality wines. But I’m not sure that the chart shows that low quality wines have higher sulphur dioxide. Most of the low quality wine is clustered in the upper or lower portion of the graph, while high quality wine is around mid-left region.
Figure 22.
The \(volatile.acidity\) of the wines is one of the best predictors of the quality. The clustering seen in the chart Figure 11, we might say it can be used to predict the \(quality\) of a red wine given \(volatile.acidity\) and \(alcohol\) values. The best quality wines have lower levels of the volatile acidity, and alcohol level above 10. Regression lines depict the separation for different quality ratings.
Wine chemistry explains the flavor, balance and color of wine. My exploration and analysis process of red wine dataset started looking for more information on the wine chemistry basics, fermentation process, and additives, which help to improve the quality of wines. My biggest struggles working on this project 1) was selecting testing methods, predictive models based on my data type, since the regression analysis includes many techniques for modeling. 2) the actual analysis, interpreting and describing the result of the plot. I think, in the class there could be presented a bigger variaty of samples, quizes or assignements, or maybe one part of a lesson could be dedicated to an overview of all available metods and techniques, when, with and what kind of data could be used with each, without going into very details or specifics.
My conclusion: the tester decisions on wine quality levels are based on their personal testes. Only very few variables have strong correlation with quality of wine. A notion in wine industry is accepted that the balance of taste and chemical ingredients is as follows:
Sweet Taste (sugars + alcohols) <= => Acid Taste (acids) + Bitter Taste (phenols)
Can we draw any conclusion about the relationship between the quality and the chemical compunds in wine, since we are presented with measurements of a small portion of elements - only a handfull of elements of the acid group, no elements of phenol group?
Also, as the quality levels of our dataset show, the sample of tested wines did not include any very low or very high quality wines. It might mean the sample is not random, which makes me question the analysis and any of my findings, which might be very well inaccurate.
I take this analysis as good practice to learn R language and RStudio, and deepen my knowledge in statistics.